CAT-LM

Official release of CAT-LM: Aligned Code And Tests Language Model.

Overview

CAT-LM is a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects. We utilize a novel pretraining signal that explicitly considers the mapping between code and test files when available. We also drastically increase the maximum sequence length of inputs to 8,192 tokens, 4x more than typical code generation models, to ensure that the code context is available to the model when generating test code. Our work highlights the importance of incorporating software-specific insights when training language models for code and paves the way to more powerful automated test generation.

Publication

CAT-LM: Training Language Models on Aligned Code And Tests
Nikitha Rao*, Kush Jain*, Uri Alon, Claire Le Goues, and Vincent J. Hellendoorn
38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023)

Usage

CAT-LM is on Hugging Face.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('nikitharao/catlm', use_fast = False)
model = AutoModelForCausalLM.from_pretrained('nikitharao/catlm')

prompt = """
def add(x,y):
    \"\"\"Add two numbers x and y\"\"\"
    return x+y
<|codetestpair|>
"""

print('Input prompt:')
print(prompt)
       
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# The model was trained without the `</s>` token and should be removed.
if tokenizer.decode(input_ids[0,-1]) == '</s>':
    input_ids = input_ids[:,:-1]

print(input_ids)
len_input = input_ids.shape[1]

sample_output = model.generate(
    input_ids,
    do_sample=True, 
    max_new_tokens = 512,
    top_k=50, 
    top_p=0.95,
    temperature=0.2
)
generated_output = sample_output[0][len_input:]
output = tokenizer.decode(generated_output, skip_special_tokens=True)
print('Output:')
print(output)

Note: The model was trained without the </s> token and should be removed.

Data and Model Training

The code and datasets for training and evaluating CAT-LM, results of additional experiments and comparison with TeCo, CodeGen and StarCoder are available at:

https://doi.org/10.5281/zenodo.7901830

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
main_sample.py		main_sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

main_sample.py

main_sample.py

Repository files navigation

CAT-LM

Overview

Publication

Usage

Data and Model Training

About

Releases

Packages

Languages

RaoNikitha/CAT-LM

Folders and files

Latest commit

History

README.md

README.md

main_sample.py

main_sample.py

Repository files navigation

CAT-LM

Overview

Publication

Usage

Data and Model Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages